Local Relation Networks for Image Recognition Paper Reading

Local Relation Networks for Image Recognition

code
左图为LR, 右图为Non-Local.

image|557x500,75% image|446x338

Geometry Prior

可学习的先验特征。以某点为中心 $k \times k$ 大小的区域的到的几何先验,通过两个 $1\times1$ 卷积 中间加relu 得到。

Appearance COmposability

feat. 过两个 $1 \times 1$ 卷积得到Key 和 Q。V是直接identity得到,不过任何操作。Q是标量,表示当前 $(x,y)$ 点与 $k \times k$ 区域的相关性,两者相乘得结果与几何先验相加,送人softmax中得到Agreagation Weights. 该结果加权乘上 $k \times k$ 区域值,得到一个标量对应 $k \times k$ 的中心点的值为该值。最后进过 一个 $1 \times 1$ 卷积得到最终输出output feat.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class LocalRelationalLayer(torch.nn.Module):
def __init__(self, channels, k, stride=1, padding=0, m=None):
super(LocalRelationalLayer, self).__init__()
self.channels = channels
self.k = k
self.stride = stride
self.m = m or 8
self.padding = padding
self.kmap = KeyQueryMap(channels, self.m)
self.qmap = KeyQueryMap(channels, self.m)
self.ac = AppearanceComposability(k, self.padding, self.stride)
self.gp = GeometryPrior(k, channels // m)
self.unfold = torch.nn.Unfold(k, 1, self.padding, self.stride)
self.final1x1 = torch.nn.Conv2d(channels, channels, 1)

def forward(self, x): # x = [N,C,H,W]
km = self.kmap(x) # [N,C/m,h,w]
qm = self.qmap(x) # [N,C/m,h,w]
ak = self.ac((km, qm)) # [N,C/m,H_out*W_out, k,k]
gpk = self.gp() # [1, C/m,k,k]
ck = combine_prior(ak, gpk.unsqueeze(2))[:, None, :, :, :] # [N, 1, C/m, H_out*W_out, k, k]
x_unfold = self.unfold(x).transpose(2, 1).contiguous().view(x.shape[0], -1, x.shape[1],
self.k * self.k).transpose(2, 1).contiguous()
x_unfold = x_unfold.view(x.shape[0], self.m, x.shape[1] // self.m, -1, self.k,
self.k) # [N, m, C/m, H_out*W_out, k,k]
pre_output = (ck * x_unfold).view(x.shape[0], x.shape[1], -1, self.k * self.k) # [N, C,HOUT*WOUT, k*k]
h_out = (x.shape[2] + 2 * self.padding - 1 * self.k) // self.stride + 1
w_out = (x.shape[3] + 2 * self.padding - 1 * self.k) // self.stride + 1
pre_output = torch.sum(pre_output, 3).view(x.shape[0], x.shape[1], h_out, w_out) # [N, C, H_out*W_out]
return self.final1x1(pre_output)

Unfold

按stride,kernel_size来抽取feat中的数据,比如输入是(2, 64, 19, 19)大小的数据,torch.nn.Unfold(k, 1, 0, 1) 中会抽取 $k \times k$ 的通道全部数据 $ (k \times k \times C) $ 然后拼接到通道维度 一个win对应 $(64 \times k \times k)$ 。 一共有 $ [(H-k + 2 \times p) // stride + 1] \ tiems [(W-k + 2 \times p) // stride + 1] $ 个win. 故输出大小为 [2, 3136, 169]